--- Credit Score in Data Analysis Course ---¶

I. Data Description¶

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

About Dataset¶

The dataset comprises information on 1000 customers, with 84 features derived from their financial transactions and current financial standing. The primary objective is to leverage this dataset for credit risk estimation and predicting potential defaults.

  • CUST_ID: Unique customer identifier

Key Target Variables:

  • CREDIT_SCORE: Numerical target variable representing the customer's credit score (integer)
  • DEFAULT: Binary target variable indicating if the customer has defaulted (1) or not (0)

Description of Features:

  • INCOME: Total income in the last 12 months

  • SAVINGS: Total savings in the last 12 months

  • DEBT: Total existing debt

  • R_SAVINGS_INCOME: Ratio of savings to income

  • R_DEBT_INCOME: Ratio of debt to income

  • R_DEBT_SAVINGS: Ratio of debt to savings

  • Transaction groups (GROCERIES, CLOTHING, HOUSING, EDUCATION, HEALTH, TRAVEL, ENTERTAINMENT,GAMBLING, UTILITIES, TAX, FINES) are categorized.

  • T_{GROUP}_6: Total expenditure in that group in the last 6 months

  • T_GROUP_12: Total expenditure in that group in the last 12 months

  • R_[GROUP]: Ratio of T_[GROUP]6 to T[GROUP]_12

  • R_[GROUP]INCOME: Ratio of T[GROUP]_12 to INCOME

  • R_[GROUP]SAVINGS: Ratio of T[GROUP]_12 to SAVINGS

  • R_[GROUP]DEBT: Ratio of T[GROUP]_12 to DEBT

Categorical Features:

  • CAT_GAMBLING: Gambling category (none, low, high)
  • CAT_DEBT: 1 if the customer has debt; 0 otherwise
  • CAT_CREDIT_CARD: 1 if the customer has a credit card; 0 otherwise
  • CAT_MORTGAGE: 1 if the customer has a mortgage; 0 otherwise
  • CAT_SAVINGS_ACCOUNT: 1 if the customer has a savings account; 0 otherwise
  • CAT_DEPENDENTS: 1 if the customer has any dependents; 0 otherwise

Summary :

  • Customer ID (string)
  • Financial numeric features (Income, Savings, Debt,…)
  • Ratios
  • Transaction amounts over 6 and 12 months
  • Binary categorical indicators (one-hot encoded) Two target variables:
  • CREDIT_SCORE (regression)
  • DEFAULT (classification)

1. Loading Dataset¶

In [2]:
df = pd.read_csv(r"./credit_score.csv")
df
Out[2]:
CUST_ID INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING ... R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
0 C02COQEVYU 33269 0 532304 0.0000 16.0000 1.2000 1889 945 0.5003 ... 0.0000 0.0625 High 1 0 0 0 0 444 1
1 C02OZKC0ZF 77158 91187 315648 1.1818 4.0909 3.4615 5818 111 0.0191 ... 0.7692 0.2222 No 1 0 0 1 0 625 0
2 C03FHP2D0A 30917 21642 534864 0.7000 17.3000 24.7142 1157 860 0.7433 ... 1.4286 0.0578 High 1 0 0 1 0 469 1
3 C03PVPPHOY 80657 64526 629125 0.8000 7.8000 9.7499 6857 3686 0.5376 ... 1.2500 0.1282 High 1 0 0 1 0 559 0
4 C04J69MUX0 149971 1172498 2399531 7.8182 16.0000 2.0465 1978 322 0.1628 ... 0.1163 0.0568 High 1 1 1 1 1 473 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 CZQHJC9HDH 328892 1465066 5501471 4.4546 16.7273 3.7551 16701 10132 0.6067 ... 0.2041 0.0543 High 1 1 1 1 1 418 0
996 CZRA4MLB0P 81404 88805 680837 1.0909 8.3637 7.6667 5400 1936 0.3585 ... 0.8333 0.1087 No 1 0 0 1 0 589 1
997 CZSOD1KVFX 0 42428 30760 3.2379 8.1889 0.7250 0 0 0.8779 ... 0.2500 0.3448 No 1 0 0 1 0 499 0
998 CZWC76UAUT 36011 8002 604181 0.2222 16.7777 75.5037 1993 1271 0.6377 ... 5.0002 0.0662 No 1 1 0 1 0 507 0
999 CZZV5B3SAL 44266 309859 44266 6.9999 1.0000 0.1429 1574 1264 0.8030 ... 0.1587 1.1111 No 1 0 0 1 0 657 0

1000 rows × 87 columns

In [3]:
df.columns[1:5]
Out[3]:
Index(['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME'], dtype='object')

2. Traget Variable¶

In [4]:
df[['DEFAULT', 'CREDIT_SCORE']]
Out[4]:
DEFAULT CREDIT_SCORE
0 1 444
1 0 625
2 1 469
3 0 559
4 0 473
... ... ...
995 0 418
996 1 589
997 0 499
998 0 507
999 0 657

1000 rows × 2 columns

In [5]:
print("Dataset infomation : ")
df.info()
Dataset infomation : 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 87 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   CUST_ID                  1000 non-null   object 
 1   INCOME                   1000 non-null   int64  
 2   SAVINGS                  1000 non-null   int64  
 3   DEBT                     1000 non-null   int64  
 4   R_SAVINGS_INCOME         1000 non-null   float64
 5   R_DEBT_INCOME            1000 non-null   float64
 6   R_DEBT_SAVINGS           1000 non-null   float64
 7   T_CLOTHING_12            1000 non-null   int64  
 8   T_CLOTHING_6             1000 non-null   int64  
 9   R_CLOTHING               1000 non-null   float64
 10  R_CLOTHING_INCOME        1000 non-null   float64
 11  R_CLOTHING_SAVINGS       1000 non-null   float64
 12  R_CLOTHING_DEBT          1000 non-null   float64
 13  T_EDUCATION_12           1000 non-null   int64  
 14  T_EDUCATION_6            1000 non-null   int64  
 15  R_EDUCATION              1000 non-null   float64
 16  R_EDUCATION_INCOME       1000 non-null   float64
 17  R_EDUCATION_SAVINGS      1000 non-null   float64
 18  R_EDUCATION_DEBT         1000 non-null   float64
 19  T_ENTERTAINMENT_12       1000 non-null   int64  
 20  T_ENTERTAINMENT_6        1000 non-null   int64  
 21  R_ENTERTAINMENT          1000 non-null   float64
 22  R_ENTERTAINMENT_INCOME   1000 non-null   float64
 23  R_ENTERTAINMENT_SAVINGS  1000 non-null   float64
 24  R_ENTERTAINMENT_DEBT     1000 non-null   float64
 25  T_FINES_12               1000 non-null   int64  
 26  T_FINES_6                1000 non-null   int64  
 27  R_FINES                  1000 non-null   float64
 28  R_FINES_INCOME           1000 non-null   float64
 29  R_FINES_SAVINGS          1000 non-null   float64
 30  R_FINES_DEBT             1000 non-null   float64
 31  T_GAMBLING_12            1000 non-null   int64  
 32  T_GAMBLING_6             1000 non-null   int64  
 33  R_GAMBLING               1000 non-null   float64
 34  R_GAMBLING_INCOME        1000 non-null   float64
 35  R_GAMBLING_SAVINGS       1000 non-null   float64
 36  R_GAMBLING_DEBT          1000 non-null   float64
 37  T_GROCERIES_12           1000 non-null   int64  
 38  T_GROCERIES_6            1000 non-null   int64  
 39  R_GROCERIES              1000 non-null   float64
 40  R_GROCERIES_INCOME       1000 non-null   float64
 41  R_GROCERIES_SAVINGS      1000 non-null   float64
 42  R_GROCERIES_DEBT         1000 non-null   float64
 43  T_HEALTH_12              1000 non-null   int64  
 44  T_HEALTH_6               1000 non-null   int64  
 45  R_HEALTH                 1000 non-null   float64
 46  R_HEALTH_INCOME          1000 non-null   float64
 47  R_HEALTH_SAVINGS         1000 non-null   float64
 48  R_HEALTH_DEBT            1000 non-null   float64
 49  T_HOUSING_12             1000 non-null   int64  
 50  T_HOUSING_6              1000 non-null   int64  
 51  R_HOUSING                1000 non-null   float64
 52  R_HOUSING_INCOME         1000 non-null   float64
 53  R_HOUSING_SAVINGS        1000 non-null   float64
 54  R_HOUSING_DEBT           1000 non-null   float64
 55  T_TAX_12                 1000 non-null   int64  
 56  T_TAX_6                  1000 non-null   int64  
 57  R_TAX                    1000 non-null   float64
 58  R_TAX_INCOME             1000 non-null   float64
 59  R_TAX_SAVINGS            1000 non-null   float64
 60  R_TAX_DEBT               1000 non-null   float64
 61  T_TRAVEL_12              1000 non-null   int64  
 62  T_TRAVEL_6               1000 non-null   int64  
 63  R_TRAVEL                 1000 non-null   float64
 64  R_TRAVEL_INCOME          1000 non-null   float64
 65  R_TRAVEL_SAVINGS         1000 non-null   float64
 66  R_TRAVEL_DEBT            1000 non-null   float64
 67  T_UTILITIES_12           1000 non-null   int64  
 68  T_UTILITIES_6            1000 non-null   int64  
 69  R_UTILITIES              1000 non-null   float64
 70  R_UTILITIES_INCOME       1000 non-null   float64
 71  R_UTILITIES_SAVINGS      1000 non-null   float64
 72  R_UTILITIES_DEBT         1000 non-null   float64
 73  T_EXPENDITURE_12         1000 non-null   int64  
 74  T_EXPENDITURE_6          1000 non-null   int64  
 75  R_EXPENDITURE            1000 non-null   float64
 76  R_EXPENDITURE_INCOME     1000 non-null   float64
 77  R_EXPENDITURE_SAVINGS    1000 non-null   float64
 78  R_EXPENDITURE_DEBT       1000 non-null   float64
 79  CAT_GAMBLING             1000 non-null   object 
 80  CAT_DEBT                 1000 non-null   int64  
 81  CAT_CREDIT_CARD          1000 non-null   int64  
 82  CAT_MORTGAGE             1000 non-null   int64  
 83  CAT_SAVINGS_ACCOUNT      1000 non-null   int64  
 84  CAT_DEPENDENTS           1000 non-null   int64  
 85  CREDIT_SCORE             1000 non-null   int64  
 86  DEFAULT                  1000 non-null   int64  
dtypes: float64(51), int64(34), object(2)
memory usage: 679.8+ KB
In [6]:
# Checking data types of each columns
print(df.dtypes)
CUST_ID                 object
INCOME                   int64
SAVINGS                  int64
DEBT                     int64
R_SAVINGS_INCOME       float64
                        ...   
CAT_MORTGAGE             int64
CAT_SAVINGS_ACCOUNT      int64
CAT_DEPENDENTS           int64
CREDIT_SCORE             int64
DEFAULT                  int64
Length: 87, dtype: object
In [7]:
# Count total number of each data types
print(df.dtypes.value_counts())
float64    51
int64      34
object      2
Name: count, dtype: int64
In [8]:
# Summary statistics 
print("Summary Statistics:")
df.describe()
Summary Statistics:
Out[8]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_INCOME R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
count 1000.000000 1.000000e+03 1.000000e+03 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 ... 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000
mean 121610.019000 4.131896e+05 7.907180e+05 4.063477 6.068449 5.867252 6822.401000 3466.320000 0.454848 0.055557 ... 0.943607 0.913340 0.605276 0.944000 0.236000 0.173000 0.993000 0.15000 586.712000 0.284000
std 113716.699591 4.429160e+05 9.817904e+05 3.968097 5.847878 16.788356 7486.225932 5118.942977 0.236036 0.037568 ... 0.168989 1.625278 1.299382 0.230037 0.424835 0.378437 0.083414 0.35725 63.413882 0.451162
min 0.000000 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.003400 ... 0.666700 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 300.000000 0.000000
25% 30450.250000 5.971975e+04 5.396675e+04 1.000000 1.454500 0.206200 1084.500000 319.500000 0.263950 0.029700 ... 0.833300 0.158700 0.100000 1.000000 0.000000 0.000000 1.000000 0.00000 554.750000 0.000000
50% 85090.000000 2.738505e+05 3.950955e+05 2.545450 4.911550 2.000000 4494.000000 1304.000000 0.468850 0.046800 ... 0.909100 0.327950 0.178600 1.000000 0.000000 0.000000 1.000000 0.00000 596.000000 0.000000
75% 181217.500000 6.222600e+05 1.193230e+06 6.307100 8.587475 4.509600 10148.500000 4555.500000 0.626300 0.069400 ... 1.000000 0.833300 0.588200 1.000000 0.000000 0.000000 1.000000 0.00000 630.000000 1.000000
max 662094.000000 2.911863e+06 5.968620e+06 16.111200 37.000600 292.842100 43255.000000 39918.000000 1.058300 0.251700 ... 2.000200 10.009900 10.005300 1.000000 1.000000 1.000000 1.000000 1.00000 800.000000 1.000000

8 rows × 85 columns

II. Data Preprocessing¶

In [9]:
df = df.drop(columns=["CUST_ID"])

drop ID out since it is just an identifier.

1. Detecting Missing values and Duplicates¶

In [10]:
N_A = df.isna().sum().sort_values(ascending=False)
N_A, N_A.sum()
Out[10]:
(INCOME                 0
 SAVINGS                0
 DEBT                   0
 R_SAVINGS_INCOME       0
 R_DEBT_INCOME          0
                       ..
 CAT_MORTGAGE           0
 CAT_SAVINGS_ACCOUNT    0
 CAT_DEPENDENTS         0
 CREDIT_SCORE           0
 DEFAULT                0
 Length: 86, dtype: int64,
 np.int64(0))
In [11]:
# duplicate
duplicate_rows = df[df.duplicated(keep=False)]

duplicate_row = df.duplicated().sum()
print("Sum total duplicates = ", duplicate_row)
Sum total duplicates =  0

2. Numerical conversion¶

In [12]:
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print("Categorical columns:", cat_cols)
Categorical columns: ['CAT_GAMBLING']
In [13]:
df["CAT_GAMBLING"].value_counts()
Out[13]:
CAT_GAMBLING
No      620
High    264
Low     116
Name: count, dtype: int64
In [14]:
gambling = {'High':2,'Low':1,'No':0}
df["CAT_GAMBLING"] = df["CAT_GAMBLING"].map(gambling)
df.head()
Out[14]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
0 33269 0 532304 0.0000 16.0000 1.2000 1889 945 0.5003 0.0568 ... 0.0000 0.0625 2 1 0 0 0 0 444 1
1 77158 91187 315648 1.1818 4.0909 3.4615 5818 111 0.0191 0.0754 ... 0.7692 0.2222 0 1 0 0 1 0 625 0
2 30917 21642 534864 0.7000 17.3000 24.7142 1157 860 0.7433 0.0374 ... 1.4286 0.0578 2 1 0 0 1 0 469 1
3 80657 64526 629125 0.8000 7.8000 9.7499 6857 3686 0.5376 0.0850 ... 1.2500 0.1282 2 1 0 0 1 0 559 0
4 149971 1172498 2399531 7.8182 16.0000 2.0465 1978 322 0.1628 0.0132 ... 0.1163 0.0568 2 1 1 1 1 1 473 0

5 rows × 86 columns

3. Variance Thresholding¶

In [15]:
variances = df.var().sort_values()
import matplotlib.pyplot as plt

plt.figure(figsize=(10,4))
plt.plot(variances.values)
plt.title("Variance of Features")
plt.ylabel("Variance")
plt.show()
No description has been provided for this image
In [16]:
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.00001)

original_columns = df.columns

selector.fit(df)
remaining_columns = df.columns[selector.get_support(indices=True)]

removed_columns = set(original_columns) - set(remaining_columns)

print("Removed columns:", removed_columns)
print("Remaining:", len(remaining_columns))
df.head()
Removed columns: {'R_HOUSING', 'R_UTILITIES', 'R_EDUCATION', 'R_FINES_INCOME', 'R_FINES_DEBT'}
Remaining: 81
Out[16]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
0 33269 0 532304 0.0000 16.0000 1.2000 1889 945 0.5003 0.0568 ... 0.0000 0.0625 2 1 0 0 0 0 444 1
1 77158 91187 315648 1.1818 4.0909 3.4615 5818 111 0.0191 0.0754 ... 0.7692 0.2222 0 1 0 0 1 0 625 0
2 30917 21642 534864 0.7000 17.3000 24.7142 1157 860 0.7433 0.0374 ... 1.4286 0.0578 2 1 0 0 1 0 469 1
3 80657 64526 629125 0.8000 7.8000 9.7499 6857 3686 0.5376 0.0850 ... 1.2500 0.1282 2 1 0 0 1 0 559 0
4 149971 1172498 2399531 7.8182 16.0000 2.0465 1978 322 0.1628 0.0132 ... 0.1163 0.0568 2 1 1 1 1 1 473 0

5 rows × 86 columns

4. Outlier detection¶

In [17]:
num_cols = df.select_dtypes(include=['int64', 'float64']).columns

cols = num_cols[:8]

fig, axes = plt.subplots(2, 4, figsize=(20, 10))

for idx, col in enumerate(cols):
    row = idx // 4    
    col_pos = idx % 4  

    axes[row, col_pos].boxplot(df[col], vert=True, patch_artist=True,
                               boxprops=dict(facecolor='orange'))
    axes[row, col_pos].set_title(f"Box Plot for {col}")
    axes[row, col_pos].set_ylabel(col)

plt.tight_layout()
plt.show()
No description has been provided for this image

In this situation, We decide not to remove the outliers because they are the customers that we want to detect.

For example, people with extremely high debt, people with zero saving. These outliers are high-risk cases.

Histogram distribution¶

In [18]:
plots_per_page = 12
cols_per_row = 4
rows_per_page = 3

for i in range(0, len(num_cols), plots_per_page):
    subset = num_cols[i:i+plots_per_page]

    fig, axes = plt.subplots(rows_per_page, cols_per_row, figsize=(20, 12))

    axes = axes.flatten()

    for j, col in enumerate(subset):
        axes[j].hist(df[col], bins=30, color='skyblue', edgecolor='black')
        axes[j].set_title(col)

    # Hide unused axes
    for k in range(len(subset), plots_per_page):
        axes[k].set_visible(False)

    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Feature Scaling¶

Group to do log-transformation

In [19]:
# Group 1 — Heavy-tailed monetary variables
money_cols = ["INCOME", "SAVINGS", "DEBT"]

# Group 2 — All T_* expenditure features
t_cols = [col for col in df.columns if col.startswith("T_")]

# Group 3 — Extreme ratio variables (_SAVINGS, _DEBT)
extreme_ratio_cols = [col for col in df.columns 
                      if col.endswith("_SAVINGS") or col.endswith("_DEBT")]

# Group 4 — Moderately skewed (_INCOME ratios)
income_ratio_cols = [col for col in df.columns if col.endswith("_INCOME")]

# Group 5 — Balanced ratio variables (no transform)
balanced_ratio_cols = [
    col for col in df.columns
    if col.startswith("R_")
    and col not in extreme_ratio_cols
    and col not in income_ratio_cols
]

# Group 6 — Binary categorical variables
binary_cols = [col for col in df.columns if col.startswith("CAT_")]

# Group 7 — Targets
target_cols = ["CREDIT_SCORE", "DEFAULT"]
In [20]:
def plot_group(columns, title, df, cols_per_row=4):
    if len(columns) == 0:
        return
    
    rows = (len(columns) + cols_per_row - 1) // cols_per_row
    fig, axes = plt.subplots(rows, cols_per_row, figsize=(20, rows * 4))
    axes = axes.flatten()

    for i, col in enumerate(columns):
        axes[i].hist(df[col], bins=30, color='skyblue', edgecolor='black')
        axes[i].set_title(col)

    # Hide empty axes
    for j in range(len(columns), len(axes)):
        axes[j].set_visible(False)

    fig.suptitle(title, fontsize=20)
    plt.tight_layout()
    plt.show()
In [21]:
plot_group(money_cols, "Group 1 — Heavy-Tailed Monetary Variables\n", df)
plot_group(t_cols, "Group 2 — T_* Expenditure Variables\n", df)
plot_group(extreme_ratio_cols, "Group 3 — Extreme Ratio Variables (_SAVINGS & _DEBT)\n", df)
plot_group(income_ratio_cols, "Group 4 — Moderately Skewed Income Ratios\n", df)
plot_group(balanced_ratio_cols, "Group 5 — Balanced Ratio Variables\n", df)
plot_group(binary_cols, "Group 6 — Binary Variables\n", df)
plot_group(target_cols, "Group 7 — Target Variables\n", df)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [22]:
# 1. Heavy monetary variables
must_transform = set(money_cols)

# 2. All T_* expenditure variables
must_transform.update(t_cols)

# 3. Extreme *_SAVINGS and *_DEBT ratios (exclude binary CAT_DEBT)
must_transform.update([c for c in extreme_ratio_cols if c != "CAT_DEBT"])

# 4. Strongly skewed income ratios (subset of income_ratio_cols)
strong_income_transform = [
    "R_SAVINGS_INCOME", "R_DEBT_INCOME", "R_EDUCATION_INCOME",
    "R_ENTERTAINMENT_INCOME", "R_GAMBLING_INCOME",
    "R_HEALTH_INCOME", "R_TAX_INCOME", "R_TRAVEL_INCOME"
]

must_transform.update(strong_income_transform)

Does transform really help in PCA? Yes, because in PCA, it already assumes that the features are roughly normal.

5. Log transformation¶

  • Log-transform fixes skew
  • Standardization fixes scale
In [23]:
# Apply log1p transform
df_log = df.copy()
df_log[list(must_transform)] = np.log1p(df[list(must_transform)])

def compare_hist(col):
    fig, axes = plt.subplots(1, 2, figsize=(12,4))
    
    axes[0].hist(df[col], bins=40, color='skyblue', edgecolor='black')
    axes[0].set_title(f"{col} — BEFORE")

    axes[1].hist(df_log[col], bins=40, color='salmon', edgecolor='black')
    axes[1].set_title(f"{col} — AFTER log1p")

    plt.tight_layout()
    plt.show()

for col in ["INCOME", "T_EXPENDITURE_12", "R_DEBT_INCOME", "R_FINES_SAVINGS"]:
    compare_hist(col)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

They all improve.

In [24]:
df_original = df.copy()

df[list(must_transform)] = np.log1p(df[list(must_transform)])
df.shape
Out[24]:
(1000, 86)

*** We can come back again and compare the result between the original and transformed one.

6. Correlation Analysis¶

We do correlation analysis before scaling.

In [25]:
pearson_corr = df.corr(method='pearson')

plt.figure(figsize=(16, 12))
sns.heatmap(pearson_corr, cmap='coolwarm', center=0)
plt.title("Pearson Correlation Heatmap", fontsize=20)
plt.show()
No description has been provided for this image
In [26]:
# 5 columns
sns.heatmap(pearson_corr.iloc[: 5, : 5], annot=True)
plt.figure(figsize =(3,2))
plt.show()
No description has been provided for this image
<Figure size 300x200 with 0 Axes>

High-correlated pairs

In [27]:
threshold = 0.85

high_corr_pairs = (
    pearson_corr
    .where(np.triu(np.ones(pearson_corr.shape), k=1).astype(bool))
    .stack()
    .reset_index()
)

high_corr_pairs.columns = ['Feature_1', 'Feature_2', 'Correlation']

# Filter by threshold and remove old index numbers
high_corr_pairs = (
    high_corr_pairs[high_corr_pairs['Correlation'].abs() >= threshold]
    .sort_values(by="Correlation", ascending=False)
    .reset_index(drop=True)
)

print(high_corr_pairs[1:5])
        Feature_1      Feature_2  Correlation
1  T_EDUCATION_12  T_EDUCATION_6     0.999907
2   T_GAMBLING_12   T_GAMBLING_6     0.999660
3  T_UTILITIES_12  T_UTILITIES_6     0.999219
4        T_TAX_12        T_TAX_6     0.999158

PCA will deal with multicollinearity.

III. EDA¶

Target Variable Analysis¶

In [28]:
# Select and Count the number of numerical columns and categorical or objects columns

object_cols = df.select_dtypes(include=['object']).columns.tolist()
number_object = len(object_cols)

numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
number_numerical = len(numeric_cols)

print(f"\nThe total objects columns in dataset is: {object_cols} \n==> equal {number_object}")
print(f"The total numerical columns in dataset is: {numeric_cols} \n==> equal {number_numerical}")
The total objects columns in dataset is: [] 
==> equal 0
The total numerical columns in dataset is: ['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME', 'R_DEBT_INCOME', 'R_DEBT_SAVINGS', 'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING', 'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT', 'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME', 'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT', 'T_ENTERTAINMENT_12', 'T_ENTERTAINMENT_6', 'R_ENTERTAINMENT', 'R_ENTERTAINMENT_INCOME', 'R_ENTERTAINMENT_SAVINGS', 'R_ENTERTAINMENT_DEBT', 'T_FINES_12', 'T_FINES_6', 'R_FINES', 'R_FINES_INCOME', 'R_FINES_SAVINGS', 'R_FINES_DEBT', 'T_GAMBLING_12', 'T_GAMBLING_6', 'R_GAMBLING', 'R_GAMBLING_INCOME', 'R_GAMBLING_SAVINGS', 'R_GAMBLING_DEBT', 'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME', 'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT', 'T_HEALTH_12', 'T_HEALTH_6', 'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT', 'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME', 'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT', 'T_TAX_12', 'T_TAX_6', 'R_TAX', 'R_TAX_INCOME', 'R_TAX_SAVINGS', 'R_TAX_DEBT', 'T_TRAVEL_12', 'T_TRAVEL_6', 'R_TRAVEL', 'R_TRAVEL_INCOME', 'R_TRAVEL_SAVINGS', 'R_TRAVEL_DEBT', 'T_UTILITIES_12', 'T_UTILITIES_6', 'R_UTILITIES', 'R_UTILITIES_INCOME', 'R_UTILITIES_SAVINGS', 'R_UTILITIES_DEBT', 'T_EXPENDITURE_12', 'T_EXPENDITURE_6', 'R_EXPENDITURE', 'R_EXPENDITURE_INCOME', 'R_EXPENDITURE_SAVINGS', 'R_EXPENDITURE_DEBT', 'CAT_GAMBLING', 'CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS', 'CREDIT_SCORE', 'DEFAULT'] 
==> equal 86
In [29]:
# Distribution of CREDIT_SCORE
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram with KDE
sns.histplot(df['CREDIT_SCORE'], kde=True, ax=axes[0], color='steelblue')
axes[0].set_title('Distribution of CREDIT_SCORE')
axes[0].axvline(df['CREDIT_SCORE'].mean(), color='red', linestyle='--', label=f"Mean: {df['CREDIT_SCORE'].mean():.1f}")
axes[0].axvline(df['CREDIT_SCORE'].median(), color='green', linestyle='--', label=f"Median: {df['CREDIT_SCORE'].median():.1f}")
axes[0].legend()

# Box plot
sns.boxplot(y=df['CREDIT_SCORE'], ax=axes[1], color='steelblue')
axes[1].set_title('Box Plot of CREDIT_SCORE')

# CREDIT_SCORE by DEFAULT
sns.boxplot(x='DEFAULT', y='CREDIT_SCORE', data=df, ax=axes[2], palette='Set2')
axes[2].set_title('CREDIT_SCORE by DEFAULT Status')
axes[2].set_xticklabels(['No Default (0)', 'Default (1)'])

plt.tight_layout()
plt.show()

# Statistics
print("CREDIT_SCORE Statistics:")
print(f"  Skewness: {df['CREDIT_SCORE'].skew():.3f}")
print(f"  Kurtosis: {df['CREDIT_SCORE'].kurtosis():.3f}")
print(f"\nCREDIT_SCORE by DEFAULT:")
print(df.groupby('DEFAULT')['CREDIT_SCORE'].describe())
C:\Users\User\AppData\Local\Temp\ipykernel_6408\1189726384.py:16: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y='CREDIT_SCORE', data=df, ax=axes[2], palette='Set2')
C:\Users\User\AppData\Local\Temp\ipykernel_6408\1189726384.py:18: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[2].set_xticklabels(['No Default (0)', 'Default (1)'])
No description has been provided for this image
CREDIT_SCORE Statistics:
  Skewness: -1.005
  Kurtosis: 2.033

CREDIT_SCORE by DEFAULT:
         count        mean        std    min     25%    50%     75%    max
DEFAULT                                                                   
0        716.0  600.008380  53.299627  385.0  566.00  606.5  635.25  800.0
1        284.0  553.190141  73.776756  300.0  516.75  565.5  609.00  691.0
In [30]:
# Credit Score Segmentation
def credit_score_category(score):
    if score < 500:
        return 'Poor'
    elif score < 600:
        return 'Fair'
    elif score < 700:
        return 'Good'
    else:
        return 'Excellent'

df['CREDIT_CATEGORY'] = df['CREDIT_SCORE'].apply(credit_score_category)

# Analysis by category
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count by category
category_order = ['Poor', 'Fair', 'Good', 'Excellent']
sns.countplot(x='CREDIT_CATEGORY', data=df, order=category_order, ax=axes[0], palette='RdYlGn')
axes[0].set_title('Distribution by Credit Score Category')
axes[0].set_xlabel('Credit Category')

# Default rate by category
default_rate = df.groupby('CREDIT_CATEGORY')['DEFAULT'].mean().reindex(category_order)
default_rate.plot(kind='bar', ax=axes[1], color=['red', 'orange', 'yellowgreen', 'green'])
axes[1].set_title('Default Rate by Credit Score Category')
axes[1].set_ylabel('Default Rate')
axes[1].set_xticklabels(category_order, rotation=0)

plt.tight_layout()
plt.show()

# Cross-tabulation
print("Cross-tabulation: CREDIT_CATEGORY vs DEFAULT")
print(pd.crosstab(df['CREDIT_CATEGORY'], df['DEFAULT'], margins=True, normalize='index').round(3))
C:\Users\User\AppData\Local\Temp\ipykernel_6408\3695684859.py:19: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.countplot(x='CREDIT_CATEGORY', data=df, order=category_order, ax=axes[0], palette='RdYlGn')
No description has been provided for this image
Cross-tabulation: CREDIT_CATEGORY vs DEFAULT
DEFAULT              0      1
CREDIT_CATEGORY              
Excellent        1.000  0.000
Fair             0.677  0.323
Good             0.814  0.186
Poor             0.333  0.667
All              0.716  0.284

Bivariate Analysis - Features vs Target Variables¶

In [31]:
# Top correlated features with CREDIT_SCORE
credit_corr = df[numeric_cols].corr()['CREDIT_SCORE'].drop('CREDIT_SCORE').sort_values(key=abs, ascending=False)

print("Top 15 features correlated with CREDIT_SCORE:")
print(credit_corr.head(15))
print("\n" + "="*50)

# Top correlated features with DEFAULT
default_corr = df[numeric_cols].corr()['DEFAULT'].drop('DEFAULT').sort_values(key=abs, ascending=False)

print("\nTop 15 features correlated with DEFAULT:")
print(default_corr.head(15))
Top 15 features correlated with CREDIT_SCORE:
R_DEBT_INCOME          -0.757480
R_DEBT_SAVINGS         -0.625888
R_EXPENDITURE_DEBT      0.453632
R_ENTERTAINMENT_DEBT    0.350135
R_GROCERIES_DEBT        0.348084
DEBT                   -0.338848
R_UTILITIES_DEBT        0.333114
DEFAULT                -0.333092
R_TRAVEL_DEBT           0.314333
R_TAX_DEBT              0.304005
R_HEALTH_DEBT           0.301370
R_EDUCATION_INCOME     -0.300259
R_CLOTHING_DEBT         0.280410
R_HOUSING_DEBT          0.270634
R_SAVINGS_INCOME        0.241467
Name: CREDIT_SCORE, dtype: float64

==================================================

Top 15 features correlated with DEFAULT:
CREDIT_SCORE           -0.333092
R_DEBT_INCOME           0.250579
R_DEBT_SAVINGS          0.197758
R_EXPENDITURE_DEBT     -0.177296
R_ENTERTAINMENT_DEBT   -0.158330
R_TAX_DEBT             -0.142238
R_GROCERIES_DEBT       -0.140743
R_UTILITIES_DEBT       -0.138511
R_HEALTH_DEBT          -0.137009
R_TRAVEL_DEBT          -0.132935
R_CLOTHING_DEBT        -0.128037
R_GROCERIES_SAVINGS     0.121917
CAT_CREDIT_CARD         0.119993
R_EDUCATION_SAVINGS     0.108363
R_UTILITIES_SAVINGS     0.099182
Name: DEFAULT, dtype: float64
In [32]:
# Visualize top correlated features
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Top 10 for CREDIT_SCORE
top_credit = credit_corr.head(10)
colors = ['green' if x > 0 else 'red' for x in top_credit.values]
top_credit.plot(kind='barh', ax=axes[0], color=colors)
axes[0].set_title('Top 10 Features Correlated with CREDIT_SCORE')
axes[0].set_xlabel('Correlation')

# Top 10 for DEFAULT
top_default = default_corr.head(10)
colors = ['green' if x > 0 else 'red' for x in top_default.values]
top_default.plot(kind='barh', ax=axes[1], color=colors)
axes[1].set_title('Top 10 Features Correlated with DEFAULT')
axes[1].set_xlabel('Correlation')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [33]:
# Scatter plots for top 6 features vs CREDIT_SCORE
top_features = credit_corr.head(6).index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for i, feature in enumerate(top_features):
    axes[i].scatter(df[feature], df['CREDIT_SCORE'], alpha=0.5, c=df['DEFAULT'], cmap='coolwarm')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('CREDIT_SCORE')
    axes[i].set_title(f'{feature} vs CREDIT_SCORE (r={credit_corr[feature]:.3f})')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [34]:
# Box plots for top 6 features by DEFAULT status
top_default_features = default_corr.head(6).index.tolist()

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for i, feature in enumerate(top_default_features):
    sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
    axes[i].set_title(f'{feature} by DEFAULT (r={default_corr[feature]:.3f})')
    axes[i].set_xticklabels(['No Default', 'Default'])

plt.tight_layout()
plt.show()
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:8: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x='DEFAULT', y=feature, data=df, ax=axes[i], palette='Set2')
C:\Users\User\AppData\Local\Temp\ipykernel_6408\425719904.py:10: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  axes[i].set_xticklabels(['No Default', 'Default'])
No description has been provided for this image

Categorical Analysis - CAT_GAMBLING vs Targets¶

In [35]:
# # CAT_GAMBLING analysis
# from scipy.stats import chi2_contingency

# fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# # Count by gambling category
# gambling_order = ['No', 'Low', 'High']
# sns.countplot(x='CAT_GAMBLING', hue='DEFAULT', data=df, order=gambling_order, ax=axes[0], palette='Set2')
# axes[0].set_title('DEFAULT by CAT_GAMBLING')
# axes[0].legend(title='DEFAULT', labels=['No Default', 'Default'])

# # Default rate by gambling category
# default_by_gambling = df.groupby('CAT_GAMBLING')['DEFAULT'].mean().reindex(gambling_order)
# default_by_gambling.plot(kind='bar', ax=axes[1], color=['green', 'orange', 'red'])
# axes[1].set_title('Default Rate by CAT_GAMBLING')
# axes[1].set_ylabel('Default Rate')
# axes[1].set_xticklabels(gambling_order, rotation=0)

# # Credit score by gambling category
# sns.boxplot(x='CAT_GAMBLING', y='CREDIT_SCORE', data=df, order=gambling_order, ax=axes[2], palette='RdYlGn_r')
# axes[2].set_title('CREDIT_SCORE by CAT_GAMBLING')

# plt.tight_layout()
# plt.show()

# # Chi-square test
# contingency_table = pd.crosstab(df['CAT_GAMBLING'], df['DEFAULT'])
# chi2, p_value, dof, expected = chi2_contingency(contingency_table)

# print("Cross-tabulation: CAT_GAMBLING vs DEFAULT")
# print(pd.crosstab(df['CAT_GAMBLING'], df['DEFAULT'], margins=True))
# print(f"\nChi-square test: chi2={chi2:.3f}, p-value={p_value:.4f}")
# print(f"Statistically significant: {'Yes' if p_value < 0.05 else 'No'}")

# print("\n" + "="*50)
# print("\nMean CREDIT_SCORE by CAT_GAMBLING:")
# print(df.groupby('CAT_GAMBLING')['CREDIT_SCORE'].agg(['mean', 'median', 'std']).round(2))
In [36]:
# Analysis of binary categorical features vs DEFAULT
binary_cats = ['CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

for i, cat in enumerate(binary_cats):
    # Default rate by category
    default_rate = df.groupby(cat)['DEFAULT'].mean()
    default_rate.plot(kind='bar', ax=axes[i], color=['steelblue', 'coral'])
    axes[i].set_title(f'Default Rate by {cat}')
    axes[i].set_ylabel('Default Rate')
    axes[i].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)
    
    # Add values on bars
    for j, v in enumerate(default_rate.values):
        axes[i].text(j, v + 0.01, f'{v:.2%}', ha='center')

axes[5].set_visible(False)
plt.tight_layout()
plt.show()

# Summary statistics
print("Default Rate by Binary Categorical Features:")
for cat in binary_cats:
    rate = df.groupby(cat)['DEFAULT'].mean()
    print(f"\n{cat}:")
    print(f"  No (0): {rate.get(0, 0):.2%}")
    print(f"  Yes (1): {rate.get(1, 0):.2%}")
No description has been provided for this image
Default Rate by Binary Categorical Features:

CAT_DEBT:
  No (0): 21.43%
  Yes (1): 28.81%

CAT_CREDIT_CARD:
  No (0): 25.39%
  Yes (1): 38.14%

CAT_MORTGAGE:
  No (0): 27.57%
  Yes (1): 32.37%

CAT_SAVINGS_ACCOUNT:
  No (0): 42.86%
  Yes (1): 28.30%

CAT_DEPENDENTS:
  No (0): 27.18%
  Yes (1): 35.33%

Feature Correlation Analysis - Identifying Redundant Features¶

In [37]:
# Find highly correlated feature pairs (potential redundancy)
corr_matrix = df[numeric_cols].corr().abs()

# Get upper triangle
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find pairs with correlation > 0.9
high_corr_pairs = []
for col in upper_tri.columns:
    for idx in upper_tri.index:
        if upper_tri.loc[idx, col] > 0.9:
            high_corr_pairs.append((idx, col, upper_tri.loc[idx, col]))

# Sort by correlation
high_corr_pairs.sort(key=lambda x: x[2], reverse=True)

print(f"Found {len(high_corr_pairs)} highly correlated feature pairs (r > 0.9):")
print("="*70)
for feat1, feat2, corr_val in high_corr_pairs[:20]:  # Show top 20
    print(f"{feat1} <-> {feat2}: {corr_val:.3f}")
Found 20 highly correlated feature pairs (r > 0.9):
======================================================================
T_HOUSING_12 <-> T_HOUSING_6: 1.000
T_EDUCATION_12 <-> T_EDUCATION_6: 1.000
T_GAMBLING_12 <-> T_GAMBLING_6: 1.000
T_UTILITIES_12 <-> T_UTILITIES_6: 0.999
T_TAX_12 <-> T_TAX_6: 0.999
T_ENTERTAINMENT_12 <-> T_ENTERTAINMENT_6: 0.998
T_GROCERIES_12 <-> T_GROCERIES_6: 0.996
T_TRAVEL_12 <-> T_TRAVEL_6: 0.991
T_EXPENDITURE_12 <-> T_EXPENDITURE_6: 0.989
T_GAMBLING_6 <-> CAT_GAMBLING: 0.982
T_GAMBLING_12 <-> CAT_GAMBLING: 0.982
INCOME <-> T_ENTERTAINMENT_12: 0.965
T_HEALTH_12 <-> T_UTILITIES_12: 0.960
T_HEALTH_12 <-> T_UTILITIES_6: 0.959
INCOME <-> T_ENTERTAINMENT_6: 0.958
R_GROCERIES_DEBT <-> R_UTILITIES_DEBT: 0.943
INCOME <-> T_CLOTHING_12: 0.940
T_CLOTHING_12 <-> T_ENTERTAINMENT_12: 0.914
T_CLOTHING_12 <-> T_ENTERTAINMENT_6: 0.913
T_GROCERIES_12 <-> T_EXPENDITURE_12: 0.901

Principal component analysis (PCA)¶

Robust_scale.png

This scaling can handle extreme value - outliers.

In [38]:
from sklearn.preprocessing import RobustScaler

x = df.drop(columns=['CREDIT_SCORE', 'DEFAULT', 'CREDIT_CATEGORY'])
y_class = df['DEFAULT']
y_reg = df['CREDIT_SCORE']
y_seg = df['CREDIT_CATEGORY']

scaler = RobustScaler()
x_scaled = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)

Fit PCA on x-scaled

In [39]:
from sklearn.decomposition import PCA

pca = PCA()                   
pca.fit(x_scaled)
Out[39]:
PCA()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_components  None
copy  True
whiten  False
svd_solver  'auto'
tol  0.0
iterated_power  'auto'
n_oversamples  10
power_iteration_normalizer  'auto'
random_state  None
In [40]:
plt.figure(figsize=(10,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='o')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("PCA Scree Plot")
plt.grid(True)
plt.show()
No description has been provided for this image

Based on PCA scree plot:

  • first component PC1 capture ~50% of variance
  • By 5 componetns, we already capture about more than 80% of the variance
  • By 10 components, the cumulative explained variance reaches ~90% and the curve started to flatten. Therefore, we decided to keep 10 principal components.
In [41]:
n_components = 10

fit pca with chosen component

In [42]:
pca = PCA(n_components=n_components)
x_pca = pca.fit_transform(x_scaled)
In [43]:
pca.explained_variance_ratio_
Out[43]:
array([0.49960918, 0.12467667, 0.10535623, 0.05484544, 0.03530568,
       0.026041  , 0.02092693, 0.0187986 , 0.01367969, 0.01269902])
In [44]:
pca.explained_variance_ratio_.sum()
Out[44]:
np.float64(0.9119384457104507)

Shows which original variables contribute the most to each PC.

In [45]:
loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(n_components)],
    index=x.columns
)
In [46]:
pc_tables = {}

for i in range(10):
    pc_name = f'PC{i+1}'
    top_vals = loadings[pc_name].abs().sort_values(ascending=False).head(10)
    pc_tables[pc_name] = pd.DataFrame({
        'Variable': top_vals.index,
        'Loading Value': top_vals.values
    })

for pc, table in pc_tables.items():
    print(f"\n==== {pc} ====")
    display(table)
==== PC1 ====
Variable Loading Value
0 R_GAMBLING_SAVINGS 0.977239
1 R_EDUCATION_SAVINGS 0.088294
2 R_HEALTH_SAVINGS 0.082452
3 R_GAMBLING_INCOME 0.076346
4 R_GAMBLING_DEBT 0.066045
5 R_CLOTHING_SAVINGS 0.059961
6 R_UTILITIES_SAVINGS 0.058493
7 R_GROCERIES_SAVINGS 0.057136
8 R_ENTERTAINMENT_SAVINGS 0.037196
9 R_HEALTH_DEBT 0.032289
==== PC2 ====
Variable Loading Value
0 R_EDUCATION_SAVINGS 0.830026
1 R_EDUCATION_DEBT 0.388140
2 R_GROCERIES_SAVINGS 0.165471
3 R_EDUCATION_INCOME 0.143035
4 R_HEALTH_SAVINGS 0.121223
5 R_GAMBLING_SAVINGS 0.117376
6 R_ENTERTAINMENT_SAVINGS 0.115602
7 R_UTILITIES_SAVINGS 0.096593
8 R_EXPENDITURE_SAVINGS 0.080639
9 R_TRAVEL_SAVINGS 0.073878
==== PC3 ====
Variable Loading Value
0 R_HEALTH_DEBT 0.430235
1 R_CLOTHING_DEBT 0.352193
2 R_TAX_DEBT 0.329979
3 R_EDUCATION_DEBT 0.328262
4 R_UTILITIES_DEBT 0.318823
5 R_HOUSING_DEBT 0.318532
6 R_ENTERTAINMENT_DEBT 0.268869
7 R_GROCERIES_DEBT 0.262130
8 R_TRAVEL_DEBT 0.192121
9 R_EXPENDITURE_DEBT 0.151929
==== PC4 ====
Variable Loading Value
0 R_EDUCATION_DEBT 0.728950
1 R_HOUSING_DEBT 0.325328
2 R_HEALTH_DEBT 0.247563
3 R_EDUCATION_SAVINGS 0.236476
4 R_TAX_DEBT 0.174421
5 INCOME 0.109946
6 R_GROCERIES_SAVINGS 0.106388
7 R_ENTERTAINMENT_SAVINGS 0.098834
8 T_TAX_12 0.098571
9 T_TAX_6 0.093652
==== PC5 ====
Variable Loading Value
0 R_EDUCATION_SAVINGS 0.374555
1 R_ENTERTAINMENT_SAVINGS 0.280735
2 R_EDUCATION_DEBT 0.239342
3 T_HEALTH_12 0.239282
4 INCOME 0.236393
5 R_TRAVEL_SAVINGS 0.220699
6 T_UTILITIES_12 0.205876
7 T_UTILITIES_6 0.194033
8 R_HEALTH_DEBT 0.182444
9 T_HEALTH_6 0.181538
==== PC6 ====
Variable Loading Value
0 R_HEALTH_SAVINGS 0.415204
1 R_UTILITIES_SAVINGS 0.290061
2 R_CLOTHING_SAVINGS 0.284561
3 R_GROCERIES_SAVINGS 0.280974
4 R_HOUSING_SAVINGS 0.259285
5 R_CLOTHING_DEBT 0.253029
6 R_TAX_DEBT 0.231399
7 R_HOUSING_DEBT 0.199283
8 R_ENTERTAINMENT_SAVINGS 0.194951
9 R_TAX_SAVINGS 0.171425
==== PC7 ====
Variable Loading Value
0 T_TRAVEL_12 0.447831
1 T_TRAVEL_6 0.328668
2 R_HOUSING_DEBT 0.326502
3 R_HOUSING_SAVINGS 0.242190
4 R_TRAVEL_SAVINGS 0.237960
5 R_HEALTH_SAVINGS 0.211043
6 R_TRAVEL_INCOME 0.210480
7 R_UTILITIES_SAVINGS 0.188626
8 R_CLOTHING_DEBT 0.182161
9 R_EDUCATION_DEBT 0.180139
==== PC8 ====
Variable Loading Value
0 R_HOUSING_DEBT 0.714689
1 T_TRAVEL_12 0.233464
2 R_HEALTH_DEBT 0.199545
3 INCOME 0.190557
4 R_CLOTHING_DEBT 0.168073
5 T_TRAVEL_6 0.167848
6 T_ENTERTAINMENT_12 0.160307
7 R_HOUSING_SAVINGS 0.157005
8 T_ENTERTAINMENT_6 0.151220
9 R_EDUCATION_DEBT 0.143039
==== PC9 ====
Variable Loading Value
0 R_TAX 0.520399
1 T_TRAVEL_12 0.300750
2 R_HEALTH_SAVINGS 0.271992
3 R_GAMBLING_DEBT 0.228584
4 R_HEALTH_DEBT 0.227304
5 T_TRAVEL_6 0.225651
6 R_TAX_DEBT 0.213570
7 R_ENTERTAINMENT_SAVINGS 0.196636
8 R_ENTERTAINMENT_INCOME 0.162187
9 R_GAMBLING_INCOME 0.154139
==== PC10 ====
Variable Loading Value
0 R_TAX 0.781085
1 T_TRAVEL_12 0.298051
2 T_TRAVEL_6 0.222763
3 R_HEALTH_SAVINGS 0.169571
4 R_ENTERTAINMENT_INCOME 0.141080
5 R_UTILITIES_SAVINGS 0.139574
6 R_HOUSING_SAVINGS 0.136124
7 R_ENTERTAINMENT_SAVINGS 0.128237
8 R_TRAVEL_DEBT 0.116832
9 R_TRAVEL_INCOME 0.115618

Interpretation for each PC¶

  • PC1: Gambling vs Saving Intensity
  • PC2: Education Spending vs Savings Capacity
  • PC3: Debt Structure Across All Spending Categories
  • PC4: Education & Housing Debt vs Income
  • PC5: Health & Utilities Spending + Education Behavior
  • PC6: Savings Ratios Across Categories
  • PC7: Travel Expenditure Pattern + Housing Debt
  • PC8: Housing Debt & Travel Spending Mix
  • PC9: Tax Behavior and Travel & Health Spending
  • PC10: Tac Burden and Travel Pattern

2D PCA Scatter Plot¶

In [47]:
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:, 0], x_pca[:, 1], c=y_seg.map({'Poor':0,'Fair':1,'Good':2,'Excellent':3}), cmap='viridis')
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter Plot by Credit Category")
plt.colorbar()
plt.show()
No description has been provided for this image
In [48]:
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:,0], x_pca[:,1], c=y_class, cmap='coolwarm', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter by DEFAULT")
plt.colorbar()
plt.show()
No description has been provided for this image
In [49]:
plt.figure(figsize=(10,7))
plt.scatter(x_pca[:,0], x_pca[:,1], c=y_reg, cmap='plasma', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA 2D Scatter by CREDIT_SCORE")
plt.colorbar()
plt.show()
No description has been provided for this image

The PCA plot (PC1 vs PC2) shows that the four credit category, risk default and credit score overlap heavily , no clear clusters nor regression line. This means the main direction of variation in data mostly by spending and saving behavior not align with credit categories.

Since PCA is unsupervised and focuses on overall variance, its normal that the credit froups are not visually seperated in the first 2 components.

With PCA data, we can also work toward a new task : Customer Financial Behavior Segmentation.

PCA-based Dataset¶

In [50]:
df_pca = pd.DataFrame(x_pca, columns=[f'PC{i+1}' for i in range(10)])
df_pca["DEFAULT"] = y_class
df_pca["CREDIT_SCORE"] = y_reg
df_pca["CATEGORY"] = y_seg

df_pca.head()
Out[50]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 DEFAULT CREDIT_SCORE CATEGORY
0 -3.879585 3.634852 -4.278596 -1.702409 -0.328213 -2.611159 1.394502 -1.163674 1.294963 1.602399 1 444 Poor
1 -4.586699 -1.677113 -2.369564 -0.077723 -1.543576 1.377409 0.259867 1.753823 1.131481 -0.182091 0 625 Good
2 10.270231 -2.368764 -2.994109 -0.493575 -2.943640 0.305558 0.860887 0.351740 -1.246965 1.675060 1 469 Poor
3 26.529054 -2.038775 -1.483680 1.014450 0.699369 -1.406972 1.236112 1.057955 1.737288 2.145868 0 559 Fair
4 -4.534533 -2.926804 -3.447148 1.402831 1.338662 -1.189801 -0.270045 0.660445 -2.096263 -1.835176 0 473 Poor

Maybe we can use PCA_data to cluster customer financial behaviour segmentation?¶

In [51]:
from sklearn.cluster import KMeans

# Use PC1–PC10 for clustering
X_cluster = df_pca[[f'PC{i+1}' for i in range(10)]]

kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X_cluster)

# add cluster labels
df_pca['Cluster'] = clusters
In [52]:
df_pca['Cluster'].unique()
Out[52]:
array([0, 1, 3, 2], dtype=int32)
In [53]:
df_pca.head()
Out[53]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 DEFAULT CREDIT_SCORE CATEGORY Cluster
0 -3.879585 3.634852 -4.278596 -1.702409 -0.328213 -2.611159 1.394502 -1.163674 1.294963 1.602399 1 444 Poor 0
1 -4.586699 -1.677113 -2.369564 -0.077723 -1.543576 1.377409 0.259867 1.753823 1.131481 -0.182091 0 625 Good 0
2 10.270231 -2.368764 -2.994109 -0.493575 -2.943640 0.305558 0.860887 0.351740 -1.246965 1.675060 1 469 Poor 0
3 26.529054 -2.038775 -1.483680 1.014450 0.699369 -1.406972 1.236112 1.057955 1.737288 2.145868 0 559 Fair 0
4 -4.534533 -2.926804 -3.447148 1.402831 1.338662 -1.189801 -0.270045 0.660445 -2.096263 -1.835176 0 473 Poor 0

Compute PCA Component means per cluster¶

In [55]:
cluster_profile = df_pca.groupby('Cluster')[ [f'PC{i+1}' for i in range(10)] ].mean().round(2)
cluster_profile
Out[55]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Cluster
0 -0.13 0.22 -2.39 1.30 1.00 0.52 0.05 0.29 0.24 -0.01
1 77.94 -1.26 2.33 -1.18 -0.05 0.29 -0.63 0.34 -0.47 0.08
2 -5.97 2.75 33.69 -2.30 4.10 2.96 0.57 -0.11 -0.40 0.81
3 -4.55 -0.56 2.40 -2.45 -2.36 -1.31 -0.11 -0.61 -0.44 -0.03

Cluster 0: Financially Stable, Low-Risk Behavior

  • PC4 = 1.30 ← moderate long-term (education + housing) debt
  • PC5 = 1.00 ← moderate essential expenses
  • PC6 = 0.52 ← slight savings weakness
  • PC2 = 0.22 ← slight education-related pressure

Cluster 1: Extreme Gambling & Financial Risk Behavior

  • PC1 = 77.94 ← this is insanely high

Cluster 2: High Debt Load & High Living Expense Pressure

  • PC3 = 33.69 ← extremely high
  • PC5 = 4.10 ← high health/utility/education burden
  • PC6 = 2.96 ← moderate savings inefficiency
  • PC2 = 2.75 ← education pressure

Cluster 3: Moderate Long-Term Debt, Moderate Essential Spending

  • PC1 = –4.55 ← low gambling behavior
  • PC3 = 2.40 but slightly lower than others
  • PC4 = –2.45 ← low long-term debt burden
  • PC5 = –2.36 ← low essential spending pressure
  • PC6 = –1.31 ← efficient savings